Today, we will focus on detecting violations of standard linear regression assumptions such as normality, homoscedasticity, and linearity. We describe methods to address these violations through data transformations and other methods. In addition to testing these model assumptions, we will discuss other data problems that adversely affect regression results: outliers, influential observations, and multi-collinearity.

The following assumptions underlie the regression model, in particular about the random errors:

  1. Normality: The errors are normally distributed.
  2. Homoscedasticity: The errors have a constant variance.
  3. No outliers: No observations deviate significantly from the specified model.
  4. Linearity: The relationship between the predictor (x) and the outcome (y) is assumed to be linear.
  5. Independence: The errors are statistically independent.

If you have not installed the packages, please run this command.

#install.packages("dplyr","MASS","car")

We start loading the libraries

library(dplyr)
library(MASS)
library(car)
rm(list = ls())
set.seed(40)

We import the data, convert the columns, and scale the data.

data <- read.csv("insurance.csv")
data$gender <- as.factor(data$gender)
data$gender <- relevel(data$gender, ref = "male")
data$smoker <- as.factor(data$smoker)
data$smoker <- relevel(data$smoker, ref = "no")
data$region <- as.factor(data$region)
data <- data %>% mutate_if(function(col) is.numeric(col) & !all(col == .$charges), scale)

Now, we create the regression model:

model <- lm(charges ~ age + bmi + children + gender + smoker + region, data)
summary(model)
## 
## Call:
## lm(formula = charges ~ age + bmi + children + gender + smoker + 
##     region, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8922.2      388.9  22.939  < 2e-16 ***
## age               3608.8      167.2  21.587  < 2e-16 ***
## bmi               2068.5      174.4  11.860  < 2e-16 ***
## children           573.2      166.1   3.451 0.000577 ***
## genderfemale       131.3      332.9   0.394 0.693348    
## smokeryes        23848.5      413.2  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.1      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

Exercise instructions

You will run the following commands using the dataset Wine Quality Data Set. The goal is to measure wine’s quality based on its properties. The data comes from [Cortez et al., 2009].

wine.data <- read.csv("winequality-red.csv")
wine.data <- wine.data %>% mutate_if(function(col) is.numeric(col) & !all(col == .$quality), scale)
wine.model <- lm(quality ~ ., wine.data)
summary(wine.model)
## 
## Call:
## lm(formula = quality ~ ., data = wine.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68911 -0.36652 -0.04699  0.45202  2.02498 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.63602    0.01621 347.788  < 2e-16 ***
## fixed_acidity         0.04351    0.04518   0.963   0.3357    
## volatile_acidity     -0.19403    0.02168  -8.948  < 2e-16 ***
## citric_acid          -0.03556    0.02867  -1.240   0.2150    
## residual_sugar        0.02303    0.02115   1.089   0.2765    
## chlorides            -0.08821    0.01973  -4.470 8.37e-06 ***
## free_sulfur_dioxide   0.04562    0.02271   2.009   0.0447 *  
## total_sulfur_dioxide -0.10739    0.02397  -4.480 8.00e-06 ***
## density              -0.03375    0.04083  -0.827   0.4086    
## pH                   -0.06386    0.02958  -2.159   0.0310 *  
## sulphates             0.15533    0.01938   8.014 2.13e-15 ***
## alcohol               0.29433    0.02822  10.429  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3561 
## F-statistic: 81.35 on 11 and 1587 DF,  p-value: < 2.2e-16

1. Checking normality

For any fixed value of X, Y is normally distributed.

We need to check the errors between the estimated and real Ys. To test the normality assumption on the errors, the recommended method is the normal quantile-quantile plot (called the normal Q-Q plot or simply the normal plot) of the residuals. This is a plot of the quantiles of the residuals versus theoretical standard normal distribution quantiles. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight.

plot(model, which=2)

We see that the normal plot for `charges`` shows extreme departures in the upper tail. This plot shows the presence of outliers, which are causing non-normality.

hist(data$charges)

Data is highly concentrated on the left side of the plot. The log transformation might help to mitigate the outliers.

hist(log(data$charges))

Now, charges looks more normally distributed. We can make a logarithmic transformation in the model to solve this issue and mitigate the outliers.

log.model <- lm(log(charges) ~ age + bmi + children + gender + smoker + region, data)
plot(log.model, which=2)

We see that the normal plot for charges shows more extreme departures in the upper tail than does the normal plot for log(charges). If these outliers are excluded, then the normal plots will be more linear.

Exercise 1

Check the Q-Q plot of the wine.model object. Should we transform the data?

plot(wine.model, which=2)

Print the histogram of the quality variable

hist(wine.data$quality)

2. Checking Homoscedasticity

The variance of residual is the same for any value of X.

One assumption of linear regressions is that the variance of the errors is the same across observations, and in particular does not depend on the values of the explanatory variables. Violation of the homoscedasticity assumption is a more serious problem than non-normality and can lead to invalid inferences on the regression coefficients.

Homoscedasticity occurs more often in datasets that have a large range between the largest and smallest observed values. While there are numerous reasons why heteroscedasticity can exist, a common explanation is that the error variance changes proportionally with a factor. In some cases, the variance increases proportionally with this factor but remains constant as a percentage. For instance, a 10% change in a number such as 100 is much smaller than a 10% change in a large number such as 100,000. In this scenario, we will expect to see larger residuals associated with higher values.

To test homoscedasticity, we plot the raw residuals against the fitted values. This plot is called the fitted values plot. If the residuals spread out evenly forming a roughly parallel band around the zero line then it indicates that the error’s standard deviation is constant supporting the homoscedasticity assumption.

plot(model, which=1)

Box-Cox Transformation

To solve the homoscedasticity problem, we need to transform the dependent variable (Y) to a normal shape. To determine which transformation the dependent variable should take, we use the Box-Cox Transformation. Based on a specific value \(\lambda\), we will determine the required transformation for the dependent variable:

As a result, the dependent variable can require a logarithmic, square root, or inverse transformations. The following table shows the most-common Box-Cox transformations:

R automatically plots the log-Likelihood as a function of possible \(\lambda\) values. The plot indicates both the value that maximizes the log-likelihood, as well as a confidence interval for the lambda value that maximizes the log-likelihood.

boxcox(model, plotit = TRUE)

The value is very close to zero (0). Using the logarithmic transformation is valid and it allows to keep the residuals’ variance constant among the independent values.

If we want to be more specific…

boxcox(model, lambda = seq(0, 0.2, by = 0.05), plotit = TRUE)

According to this transformation, the transformation for charges should be \((y ^{0.14} - 1)/0.14\). We can run this new transformation and observe the results.

boxcox.model = lm((((charges ^ 0.14) - 1) / 0.14) ~ age + gender + bmi + children + smoker + region, data)
summary(boxcox.model)
## 
## Call:
## lm(formula = (((charges^0.14) - 1)/0.14) ~ age + gender + bmi + 
##     children + smoker + region, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4148 -0.7527 -0.2352  0.1724  7.7198 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     17.59378    0.10043 175.187  < 2e-16 ***
## age              1.64292    0.04317  38.060  < 2e-16 ***
## genderfemale     0.22834    0.08597   2.656 0.008000 ** 
## bmi              0.33136    0.04503   7.358 3.25e-13 ***
## children         0.39012    0.04289   9.095  < 2e-16 ***
## smokeryes        5.81891    0.10668  54.546  < 2e-16 ***
## regionnorthwest -0.21254    0.12298  -1.728 0.084166 .  
## regionsoutheast -0.51838    0.12360  -4.194 2.92e-05 ***
## regionsouthwest -0.43433    0.12341  -3.519 0.000447 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.565 on 1329 degrees of freedom
## Multiple R-squared:  0.7761, Adjusted R-squared:  0.7748 
## F-statistic: 575.9 on 8 and 1329 DF,  p-value: < 2.2e-16

We can plot the residual errors and the quartiles.

plot(boxcox.model, which=1)

Exercise 2.

Plot the residuals vs fitted values of the wine.model. Use the function plot with the argument which=1. What can you see?

plot(wine.model, which=1)

Run the Box-Cox regression to check what transformations should be required

boxcox(wine.model, plotit = TRUE)

The value is close to the unit (1), which means that no transformation should be required (\((y^1-1)/1 \sim y\)).

3. Checking Outliers

Outliers are observations that deviate significantly from the fitted model, e.g., by more than two or three standard deviations. An outlier has a large residual (the distance between the predicted value and the observed value). Outliers lower the significance of the fit of a statistical model because they do not coincide with the model’s prediction.

Outliers should not be deleted without additional inspection. First, they must be checked for validity and should be deleted only if they are erroneous. If they are valid observations then they may indicate model mis-specification. For example, we may be fitting a straight line to data that actually follow a quadratic or an exponential model. Thus an outlier may be useful for revealing a mispecified model.

Let’s check first the original model’s residuals.

model.res = resid(model)

We can plot the residuals of each observation.

plot(c(1:nrow(data)), model.res, ylab="Residuals", xlab="Observation", main="Residual Plot") 

Now, we check the standard residuals of the logarithmic model

log.model.res = resid(log.model)

We plot each observation’s residual.

plot(c(1:nrow(data)), log.model.res, ylab="Residuals", xlab="Observation", main="Residual Plot") 

A rule of thumb is considering as outliers observations that have standardized residual larger than 2. We will check which observations have standard residuals bigger than 2.

log.model.res[abs(log.model.res) > 2]
##      220      431      517     1028 
## 2.116146 2.120903 2.166365 2.043193

These observations are outliers found in the right side of the charges’ distribution. We can check this with the Q-Q plot too.

plot(log.model, which=2,cex.id = 2)

Exercise 3.

Check for outliers in the wine.data. Calculate and assign to the variable wine.model.residuals the residuals of the model wine.model using the function resid(). Then, plot the residuals against the wine.data observations.

wine.model.residuals = resid(wine.model)
plot(c(1:nrow(wine.data)), wine.model.residuals, ylab="Residuals", xlab="Observation", main="Residual Plot") 

Check for potential outliers using the rule of thumb (std. residuals bigger than 2). How many potential outliers do you see?

wine.model.residuals[abs(wine.model.residuals) > 2]
##        46       391       460       653       833       900      1277      1479 
## -2.000036  2.024975 -2.061787 -2.474653 -2.689107 -2.137879 -2.327958 -2.132611 
##      1506 
## -2.223015

Identifying Influential Observations.

The idea of fitting a model is to capture the overall pattern of variation in the response variable as a function of the predictor variables. So the fit of the model should be determined by the majority of the data and not by a few so-called influential observations (also called high leverage observations).

An influential observation is any observation that has a large effect on the slope of a regression line fitting the data. They are generally extreme values. The process to identify an influential observation begins by removing the suspected influential point from the data set. If this removal significantly changes the slope of the regression line, then the point is considered an influential observation.

A common measure of influence is Cook’s Distance. This is a function of the standardized residuals. Cook’s distance for the ith observation measures the effect of deleting that observation on the fitted values of all observations

plot(log.model, which=4)

A rule of thumb is to delete those observations with a Cook distance higher than 4/(number of observations).

log.model.cook.distances <- cooks.distance(log.model)
influential_obs <- log.model.cook.distances[(log.model.cook.distances) > 4/nrow(data)]
influential_obs <- as.numeric(names(influential_obs))

We create the model without the influential observations

data.without.influential <- data[-influential_obs,]
log.model.no.outliers <- lm(log(charges) ~ age + bmi + children + gender + smoker + region, data.without.influential)

We plot the distribution of these values without the influential observations.

hist(log(data.without.influential$charges))

We print the model’s results

summary(log.model.no.outliers)
## 
## Call:
## lm(formula = log(charges) ~ age + bmi + children + gender + smoker + 
##     region, data = data.without.influential)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.84090 -0.10713  0.00638  0.07927  1.15633 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.734615   0.018169 480.741  < 2e-16 ***
## age              0.571294   0.007954  71.824  < 2e-16 ***
## bmi              0.057434   0.008209   6.997 4.30e-12 ***
## children         0.135800   0.007738  17.550  < 2e-16 ***
## genderfemale     0.091397   0.015467   5.909 4.45e-09 ***
## smokeryes        1.600646   0.019773  80.951  < 2e-16 ***
## regionnorthwest -0.077126   0.022220  -3.471 0.000536 ***
## regionsoutheast -0.141325   0.022318  -6.332 3.38e-10 ***
## regionsouthwest -0.116561   0.022200  -5.250 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2708 on 1227 degrees of freedom
## Multiple R-squared:  0.9096, Adjusted R-squared:  0.909 
## F-statistic:  1543 on 8 and 1227 DF,  p-value: < 2.2e-16

We check once again the Q-Q plot

plot(log.model.no.outliers, which=2)

In summary, the model improved (R^2=0.90) and the coefficients are all significant.

Exercise 4

Identify any influential observations in the wine.data dataframe by calculating the Cook’s Distances. Start first by plotting the graph

plot(wine.model, which=4)

Calculate Cook’s distance values of each observation of the model wine.model using the function cooks.distance(). Then, use the rule of thumb to identify the influential observations.

wine.model.cook.distances <- cooks.distance(wine.model)
influential_obs <- wine.model.cook.distances[(wine.model.cook.distances) > 4/nrow(wine.data)]
influential_obs <- as.numeric(names(influential_obs))

Then, remove the influential observations of the wine.data and create a new model

wine.data.without.influential <- wine.data[-influential_obs,]
wine.model.no.outliers <- lm(quality ~ ., wine.data.without.influential)
summary(wine.model.no.outliers)
## 
## Call:
## lm(formula = quality ~ ., data = wine.data.without.influential)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.67161 -0.36353 -0.05916  0.39741  1.85537 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.65496    0.01439 393.023  < 2e-16 ***
## fixed_acidity         0.10537    0.04270   2.468   0.0137 *  
## volatile_acidity     -0.17616    0.02063  -8.539  < 2e-16 ***
## citric_acid          -0.06540    0.02640  -2.477   0.0134 *  
## residual_sugar        0.05231    0.02150   2.433   0.0151 *  
## chlorides            -0.08379    0.01966  -4.262 2.16e-05 ***
## free_sulfur_dioxide   0.02744    0.02089   1.314   0.1891    
## total_sulfur_dioxide -0.10657    0.02257  -4.722 2.55e-06 ***
## density              -0.06528    0.03807  -1.715   0.0866 .  
## pH                   -0.02644    0.02733  -0.967   0.3335    
## sulphates             0.18965    0.01955   9.701  < 2e-16 ***
## alcohol               0.27940    0.02615  10.684  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5553 on 1491 degrees of freedom
## Multiple R-squared:  0.4272, Adjusted R-squared:  0.423 
## F-statistic: 101.1 on 11 and 1491 DF,  p-value: < 2.2e-16

We check once again the Q-Q plot

plot(wine.model.no.outliers, which=2)

Much better!

4. Model specification

The most common type of misspecification is non-linearity. We can check whether the model is correctly specified by plotting the dependent variable and each independent variable. Since the model was built using different independent variables, we need to control the presence of other independent variables while we change variable of the observed independent variable. To do this, we plot the added variable plots. These plots display the relationship between the dependent variable and one independent variable in the regression model, while holding the other independent variables constant. These plots are also called partial regression plots.

To create added variable plots in R, we can use the avPlots() function and use it for each numeric variable:

avPlot(log.model.no.outliers,"age")

avPlot(log.model.no.outliers,"bmi")

avPlot(log.model.no.outliers,"children")

The x-axis displays a single predictor variable and the y-axis displays the response variable. The blue line shows the association between the predictor variable and the response variable, while holding the value of all other predictor variables constant. The points that are labelled in each plot represent the two observations with the largest residuals and the two observations with the largest partial leverage.

After checking the plots for the logarithmic model, we can see clearly that there is a linear relationship between the numeric variables and the dependent variable charges.

5. Multicollinearity Diagnostics

Multicollinearity occurs when independent variables in a regression model are correlated. In other words, one independent variable can be used to predict another independent variable. This correlation is a problem because independent variables should be independent, and it creates redundant information skewing the results in a regression model. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.

The stronger the correlation, the more difficult it is to change one variable without changing another. It becomes difficult for the model to estimate the relationship between each independent variable and the dependent variable independently because the independent variables tend to change in unison.

Some examples of correlated independent variables are: a person’s height and weight, age and sales price of a car, or years of education and annual income. Multicollinearity causes the following two basic types of problems:

Variance Inflation Factors (VIF)

The variance inflation factor (VIF) test identifies correlation between independent variables and the strength of that correlation. A common rule of thumb is to declare multicollinearity if most of the VIF are larger than 10.

car::vif(log.model.no.outliers)
##              GVIF Df GVIF^(1/(2*Df))
## age      1.030850  1        1.015308
## bmi      1.123649  1        1.060023
## children 1.005972  1        1.002982
## gender   1.007666  1        1.003826
## smoker   1.014312  1        1.007130
## region   1.101291  3        1.016211

In this example, the VIF scores are lower than 2.0. Therefore, there are no signs of multicollinearity with this model.

Exercise 5

Check whether the wine.model.no.outliers has multicollinearity.

car::vif(wine.model.no.outliers)
##        fixed_acidity     volatile_acidity          citric_acid 
##             8.062905             1.891349             3.261109 
##       residual_sugar            chlorides  free_sulfur_dioxide 
##             1.681510             1.432694             2.053849 
## total_sulfur_dioxide              density                   pH 
##             2.301593             6.464433             3.343257 
##            sulphates              alcohol 
##             1.435240             3.145932

Resources